In this assignment, I will be exploring, analyzing and modeing the money ball dataset. Each record represents a professional baseball team from the years 1871 to 2006 inclusive. Each record has the performance of the team for the given year, with all of the statistics adjusted to match the performance of a 162 game season. the purpose of this assignment is to build a multiple linear regression model on the training data to predict the number of wins for the team.
Variables:
INDEX Identification Variable (do not use)
TARGET_WINS Number of wins
TEAM_BATTING_H Base Hits by batters (1B,2B,3B,HR)
TEAM_BATTING_2B Doubles by batters (2B)
TEAM_BATTING_3B Triples by batters (3B)
TEAM_BATTING_HR Homeruns by batters (4B)
TEAM_BATTING_BB Walks by batters Positive
TEAM_BATTING_HBP Batters hit by pitch (get a free base)
TEAM_BATTING_SO Strikeouts by batters
TEAM_BASERUN_SB Stolen bases
TEAM_BASERUN_CS Caught stealing
TEAM_FIELDING_E Errors
TEAM_FIELDING_DP Double Plays
TEAM_PITCHING_BB Walks allowed
TEAM_PITCHING_H Hits allowed
TEAM_PITCHING_HR Homeruns allowed
TEAM_PITCHING_SO Strikeouts by pitchers
# Loading library
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)
library(DataExplorer)
library(mice)
##
## Attaching package: 'mice'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## cbind, rbind
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(corrplot)
## corrplot 0.92 loaded
library(reshape)
##
## Attaching package: 'reshape'
##
## The following object is masked from 'package:lubridate':
##
## stamp
##
## The following object is masked from 'package:dplyr':
##
## rename
##
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
library(reshape2)
##
## Attaching package: 'reshape2'
##
## The following objects are masked from 'package:reshape':
##
## colsplit, melt, recast
##
## The following object is masked from 'package:tidyr':
##
## smiths
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
##
## The following object is masked from 'package:purrr':
##
## lift
library(dplyr)
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(caret) # for data splitting and pre-processing
library(stats)
we have two data sets: -** a training set : where most analysis will be doing -** A evaluation set Which will be used to evaluate the model
# load data money ball
#evaluation set use for test set
money_ball_eval <- read.csv('moneyball-evaluation-data.csv')
# training set
money_ball_train <- read.csv('moneyball-training-data.csv')
str(money_ball_train)
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
introduce(money_ball_train)
## rows columns discrete_columns continuous_columns all_missing_columns
## 1 2276 17 0 17 0
## total_missing_values complete_rows total_observations memory_usage
## 1 3478 191 38692 159280
# Data Description
money_ball_train %>%
summary() %>%
kable() %>% kable_styling() %>% kable_classic(full_width = F, html_font = "Cambria")
| INDEX | TARGET_WINS | TEAM_BATTING_H | TEAM_BATTING_2B | TEAM_BATTING_3B | TEAM_BATTING_HR | TEAM_BATTING_BB | TEAM_BATTING_SO | TEAM_BASERUN_SB | TEAM_BASERUN_CS | TEAM_BATTING_HBP | TEAM_PITCHING_H | TEAM_PITCHING_HR | TEAM_PITCHING_BB | TEAM_PITCHING_SO | TEAM_FIELDING_E | TEAM_FIELDING_DP | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Min. : 1.0 | Min. : 0.00 | Min. : 891 | Min. : 69.0 | Min. : 0.00 | Min. : 0.00 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. :29.00 | Min. : 1137 | Min. : 0.0 | Min. : 0.0 | Min. : 0.0 | Min. : 65.0 | Min. : 52.0 | |
| 1st Qu.: 630.8 | 1st Qu.: 71.00 | 1st Qu.:1383 | 1st Qu.:208.0 | 1st Qu.: 34.00 | 1st Qu.: 42.00 | 1st Qu.:451.0 | 1st Qu.: 548.0 | 1st Qu.: 66.0 | 1st Qu.: 38.0 | 1st Qu.:50.50 | 1st Qu.: 1419 | 1st Qu.: 50.0 | 1st Qu.: 476.0 | 1st Qu.: 615.0 | 1st Qu.: 127.0 | 1st Qu.:131.0 | |
| Median :1270.5 | Median : 82.00 | Median :1454 | Median :238.0 | Median : 47.00 | Median :102.00 | Median :512.0 | Median : 750.0 | Median :101.0 | Median : 49.0 | Median :58.00 | Median : 1518 | Median :107.0 | Median : 536.5 | Median : 813.5 | Median : 159.0 | Median :149.0 | |
| Mean :1268.5 | Mean : 80.79 | Mean :1469 | Mean :241.2 | Mean : 55.25 | Mean : 99.61 | Mean :501.6 | Mean : 735.6 | Mean :124.8 | Mean : 52.8 | Mean :59.36 | Mean : 1779 | Mean :105.7 | Mean : 553.0 | Mean : 817.7 | Mean : 246.5 | Mean :146.4 | |
| 3rd Qu.:1915.5 | 3rd Qu.: 92.00 | 3rd Qu.:1537 | 3rd Qu.:273.0 | 3rd Qu.: 72.00 | 3rd Qu.:147.00 | 3rd Qu.:580.0 | 3rd Qu.: 930.0 | 3rd Qu.:156.0 | 3rd Qu.: 62.0 | 3rd Qu.:67.00 | 3rd Qu.: 1682 | 3rd Qu.:150.0 | 3rd Qu.: 611.0 | 3rd Qu.: 968.0 | 3rd Qu.: 249.2 | 3rd Qu.:164.0 | |
| Max. :2535.0 | Max. :146.00 | Max. :2554 | Max. :458.0 | Max. :223.00 | Max. :264.00 | Max. :878.0 | Max. :1399.0 | Max. :697.0 | Max. :201.0 | Max. :95.00 | Max. :30132 | Max. :343.0 | Max. :3645.0 | Max. :19278.0 | Max. :1898.0 | Max. :228.0 | |
| NA | NA | NA | NA | NA | NA | NA | NA’s :102 | NA’s :131 | NA’s :772 | NA’s :2085 | NA | NA | NA | NA’s :102 | NA | NA’s :286 |
str(money_ball_train)
## 'data.frame': 2276 obs. of 17 variables:
## $ INDEX : int 1 2 3 4 5 6 7 8 11 12 ...
## $ TARGET_WINS : int 39 70 86 70 82 75 80 85 86 76 ...
## $ TEAM_BATTING_H : int 1445 1339 1377 1387 1297 1279 1244 1273 1391 1271 ...
## $ TEAM_BATTING_2B : int 194 219 232 209 186 200 179 171 197 213 ...
## $ TEAM_BATTING_3B : int 39 22 35 38 27 36 54 37 40 18 ...
## $ TEAM_BATTING_HR : int 13 190 137 96 102 92 122 115 114 96 ...
## $ TEAM_BATTING_BB : int 143 685 602 451 472 443 525 456 447 441 ...
## $ TEAM_BATTING_SO : int 842 1075 917 922 920 973 1062 1027 922 827 ...
## $ TEAM_BASERUN_SB : int NA 37 46 43 49 107 80 40 69 72 ...
## $ TEAM_BASERUN_CS : int NA 28 27 30 39 59 54 36 27 34 ...
## $ TEAM_BATTING_HBP: int NA NA NA NA NA NA NA NA NA NA ...
## $ TEAM_PITCHING_H : int 9364 1347 1377 1396 1297 1279 1244 1281 1391 1271 ...
## $ TEAM_PITCHING_HR: int 84 191 137 97 102 92 122 116 114 96 ...
## $ TEAM_PITCHING_BB: int 927 689 602 454 472 443 525 459 447 441 ...
## $ TEAM_PITCHING_SO: int 5456 1082 917 928 920 973 1062 1033 922 827 ...
## $ TEAM_FIELDING_E : int 1011 193 175 164 138 123 136 112 127 131 ...
## $ TEAM_FIELDING_DP: int NA 155 153 156 168 149 186 136 169 159 ...
We are going to check distribution for all the columns and outliers that are present in the data set. The Chart below shows the percentage accounted for missing values in the data set. We notice that TEAM_BATTING_HBP has the highest number of missing values but, the mean, the median, the max are around the same rangle which mean that column is skewed centerely.
# Correlation Plot
cor_matrix <- cor(money_ball_train, use = 'complete.obs')
corrplot(cor_matrix, method = 'circle')
money <- money_ball_train
# Create missing data flags
# Create missing data flags
#missing_flag <- ifelse(is.na(money_ball_train$TEAM_BATTING_HBP), 1, 2)
#money_ball_train$missing_flag <- missing_flag
par(mfrow=c(3,3))
# Create distribution plot to check outliers
for (i in 1:17) {
hist(money[,i],main=names(money[i]),xlab=names(money[i]),breaks = 51)
boxplot(money[,i], main=names(money[i]), type="l",horizontal = TRUE)
plot(money[,i], money$TARGET_WINS, main = names(money[i]), xlab=names(money[i]))
abline(lm(money$TARGET_WINS ~ money[,i], data = money), col = "blue")
}
The relationship between the target variable and the predictors is not particularly strong, but there are statistically significant relationships with certain variables. The calculate_correlations_with_pvalues function computes the correlation coefficients and p-values between a given target variable and all other predictor variables in a dataset. It first validates the input to ensure that the target variable is present and that the data contains no missing values. For each predictor, the function performs a correlation test using cor.test(), which returns both the correlation coefficient and the corresponding p-value. These results are stored in a data frame, with both the correlation and p-value rounded to 10 decimal places for precision. This function offers a clear, organized output, making it easy for users to evaluate the strength and statistical significance of the relationships between the target variable and the predictors.
calculate_correlations_with_pvalues <- function(data, target_col) {
# Check if target_col is a character string
if (!is.character(target_col) || length(target_col) != 1) {
stop("target_col must be a single character string.")
}
# Ensure the target column exists in the data
if (!target_col %in% names(data)) {
stop("Target column not found in the dataframe.")
}
# Remove rows with missing values
data_complete <- data[complete.cases(data), ]
# Initialize a results data frame
results <- data.frame(Predictor = character(), Correlation = numeric(), PValue = numeric(), stringsAsFactors = FALSE)
# Loop through each predictor variable
for (predictor in names(data_complete)[names(data_complete) != target_col]) {
# Perform the correlation test
test_result <- cor.test(data_complete[[predictor]], data_complete[[target_col]])
# Store the rounded results to 10 decimal places
results <- rbind(results, data.frame(Predictor = predictor,
Correlation = round(test_result$estimate, 10),
PValue = round(test_result$p.value, 10)))
}
return(results)
}
# Example usage
correlation_results <- calculate_correlations_with_pvalues(money_ball_train, "TARGET_WINS")
# View the results
print(correlation_results)
## Predictor Correlation PValue
## cor INDEX -0.04895047 0.5012836479
## cor1 TEAM_BATTING_H 0.46994665 0.0000000000
## cor2 TEAM_BATTING_2B 0.31298400 0.0000104138
## cor3 TEAM_BATTING_3B -0.12434586 0.0865523694
## cor4 TEAM_BATTING_HR 0.42241683 0.0000000012
## cor5 TEAM_BATTING_BB 0.46868793 0.0000000000
## cor6 TEAM_BATTING_SO -0.22889273 0.0014476066
## cor7 TEAM_BASERUN_SB 0.01483639 0.8385814774
## cor8 TEAM_BASERUN_CS -0.17875598 0.0133534492
## cor9 TEAM_BATTING_HBP 0.07350424 0.3122327101
## cor10 TEAM_PITCHING_H 0.47123431 0.0000000000
## cor11 TEAM_PITCHING_HR 0.42246683 0.0000000011
## cor12 TEAM_PITCHING_BB 0.46839882 0.0000000000
## cor13 TEAM_PITCHING_SO -0.22936481 0.0014142043
## cor14 TEAM_FIELDING_E -0.38668800 0.0000000329
## cor15 TEAM_FIELDING_DP -0.19586601 0.0066174528
Now let dive into the Missing values.
#
plot_intro(money_ball_train, title = 'Missing Information on Meny Ball Dataset',
ggtheme = theme_minimal())
# Plot missing volume in Column
plot_missing(money_ball_train,title = 'Information about Missing Value in money ball dataset',ggtheme = theme_minimal())
we have discovered that there are some observations that are missing data. These column names will need imputation after analysis. We have about 9% of the data missing that spread out to > TEAM_PITCHING_SO 4.48% > TEAM_BATTING_SO 4.48% > TEAM_BASERUN_SB 5.76% > TEAM_BASERUN_CS 12.57% > TEAM_BATTING_HBP 91.61%
Why Use Predictive Imputation?
Preserves Relationships: Predictive imputation uses the relationships between variables to estimate missing values, which can lead to more accurate and reliable data sets.
Handles Different Types of Missing Data: It can handle data that is Missing Completely at Random (MCAR), Missing at Random (MAR), and even some cases of Missing Not at Random (MNAR). Maintains Variability: Unlike simple imputation methods (mean, median), predictive imputation maintains the natural variability in the data.
# re-assign moneyball
money <- money_ball_train
# create data set with with predictive imputable
# Compute multiple imputation
data1 <- mice(money, method = 'pmm', m=5)
##
## iter imp variable
## 1 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## Warning: Number of logged events: 25
completed_data <- complete(data1)
money_ball_eval <- mice(money_ball_eval, method = 'pmm', m=5)
##
## iter imp variable
## 1 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 1 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 2 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 3 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 4 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 1 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 2 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 3 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 4 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## 5 5 TEAM_BATTING_SO TEAM_BASERUN_SB TEAM_BASERUN_CS TEAM_BATTING_HBP TEAM_PITCHING_SO TEAM_FIELDING_DP
## Warning: Number of logged events: 50
money_ball_eval <- complete(money_ball_eval)
# View missing value distribution
plot_missing(completed_data)
plot_scatterplot(money[,-c(1,11)],by="TARGET_WINS", nrow = 1L, ncol = 2L)
## Warning: Removed 102 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 903 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 102 rows containing missing values or values outside the scale range
## (`geom_point()`).
## Warning: Removed 286 rows containing missing values or values outside the scale range
## (`geom_point()`).
# after data transformation
par(mfrow=c(3,3))
for (i in 1:17) {
hist(completed_data[,i],main=names(completed_data[i]),xlab=names(completed_data[i]),breaks = 51)
boxplot(completed_data[,i], main=names(completed_data[i]), type="l",horizontal = TRUE)
plot(completed_data[,i], completed_data$TARGET_WINS, main = names(completed_data[i]), xlab=names(completed_data[i]))
abline(lm(completed_data$TARGET_WINS ~ completed_data[,i], data = completed_data), col = "blue")
}
The remove_outliers_df function removes outliers from
all numeric columns in a dataset using the Interquartile Range
(IQR) method. It loops through each column, checks if it’s
numeric, and calculates the first (Q1) and third quartiles (Q3) to
determine the IQR. It then sets upper and lower bounds as
Q1 - 1.5 * IQR and Q3 + 1.5 * IQR, and removes
any rows where the values in a numeric column fall outside these bounds.
Non-numeric columns are ignored, ensuring only numeric data is affected.
This results in a dataset with outliers removed from all numeric
columns.
remove_outliers_df <- function(df) {
# Loop through each numeric column in the data frame
for (col in names(df)) {
if (is.numeric(df[[col]])) {
# Calculate Q1, Q3, and IQR for the column
Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1
# Set lower and upper bounds
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
# Remove rows where the value in this column is an outlier
df <- df[df[[col]] >= lower_bound & df[[col]] <= upper_bound, ]
}
}
return(df)
}
# Remove the outlier in completed dataset
money_ball <- remove_outliers_df(completed_data)
# Visualize the distribution
par(mfrow=c(3,3))
for (i in 1:17) {
hist(money_ball[,i],main=names(money_ball[i]),xlab=names(money_ball[i]),breaks = 51)
boxplot(money_ball[,i], main=names(money_ball[i]), type="l",horizontal = TRUE)
plot(money_ball[,i], money_ball$TARGET_WINS, main = names(money_ball[i]), xlab=names(money_ball[i]))
abline(lm(money_ball$TARGET_WINS ~ money_ball[,i], data = money_ball), col = "blue")
}
# see how the density plot shows distribution
melt(money_ball) %>% ggplot(aes(x= value)) +
geom_density(fill='red') + facet_wrap(~variable, scales = 'free')
## No id variables; using all as measure variables
The density plot shows how the values of each variable are distributed across the range. Peaks in the density curve indicate where values are concentrated, while troughs represent areas with fewer observations. After removing outliers, let see if there is an improved correlation among the variables.
# Example usage
corre_updated_results <- calculate_correlations_with_pvalues(money_ball, "TARGET_WINS")
# View the results
print(corre_updated_results)
## Predictor Correlation PValue
## cor INDEX 0.02716337 0.2819341234
## cor1 TEAM_BATTING_H 0.34303778 0.0000000000
## cor2 TEAM_BATTING_2B 0.19432525 0.0000000000
## cor3 TEAM_BATTING_3B 0.13103793 0.0000001867
## cor4 TEAM_BATTING_HR 0.22028779 0.0000000000
## cor5 TEAM_BATTING_BB 0.29280036 0.0000000000
## cor6 TEAM_BATTING_SO -0.06294659 0.0125804516
## cor7 TEAM_BASERUN_SB 0.16624738 0.0000000000
## cor8 TEAM_BASERUN_CS 0.03795936 0.1326082561
## cor9 TEAM_BATTING_HBP 0.01612192 0.5231229119
## cor10 TEAM_PITCHING_H 0.28084171 0.0000000000
## cor11 TEAM_PITCHING_HR 0.22399974 0.0000000000
## cor12 TEAM_PITCHING_BB 0.28313894 0.0000000000
## cor13 TEAM_PITCHING_SO -0.06898842 0.0062286091
## cor14 TEAM_FIELDING_E -0.18279144 0.0000000000
## cor15 TEAM_FIELDING_DP -0.09134064 0.0002888929
We improve p-values for most of the variables, but the correlation didn’t improved among the variables. As we notice the previous correlation without transformation didn’t show sign of good relationship among the target variables.
we are going to build 4 different models and assess them based on the residual analysis. The Residual Chart contains four diagnostic plots from a multiple linear regression analysis. These plots help assess the validity of the model by checking assumptions and identifying potential issues. Here’s what each plot represents:
# Initiate the model
model <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B, TEAM_BATTING_3B+
TEAM_BATTING_HR+TEAM_BATTING_BB+TEAM_BATTING_SO+TEAM_BASERUN_SB,
data = money_ball)
# Summarize the model
summary(model)
##
## Call:
## lm(formula = TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B,
## data = money_ball, subset = TEAM_BATTING_3B + TEAM_BATTING_HR +
## TEAM_BATTING_BB + TEAM_BATTING_SO + TEAM_BASERUN_SB)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.641 -7.905 0.688 7.764 37.085
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3.299452 6.381482 -0.517 0.60530
## TEAM_BATTING_H 0.065262 0.005339 12.223 < 2e-16 ***
## TEAM_BATTING_2B -0.040766 0.013838 -2.946 0.00333 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.71 on 659 degrees of freedom
## (909 observations deleted due to missingness)
## Multiple R-squared: 0.2121, Adjusted R-squared: 0.2097
## F-statistic: 88.68 on 2 and 659 DF, p-value: < 2.2e-16
# Model 1 evaluation
mse <- mean(model$residuals^2)
r_squared <- summary(model)$r.squared
f_stat <- summary(model)$fstatistic[1]
print(paste("MSE:", mse, "R-squared:", r_squared, "F-statistic:", f_stat))
## [1] "MSE: 136.489072082661 R-squared: 0.212070871069777 F-statistic: 88.6848187886701"
predictions <- predict(model, newdata = money_ball_eval)
head(predictions)
## 1 2 3 4 5 6
## 68.67246 70.23015 80.28129 84.54257 82.72908 80.47014
par(mfrow=c(2,2))
plot(model)
The model explains 28.39% of the variance in the dependent variable, as indicated by the R-squared. The adjusted R-squared, at 28.07%, confirms that the predictors in the model are meaningful. The F-statistic (89.98) is large, and the p-value (< 2.2e-16) is very small, suggesting that the model is statistically significant overall, meaning that at least one predictor significantly contributes to explaining the dependent variable.
# Ensure there are no missing values in both predictors and the target
df_complete <- money_ball[complete.cases(money_ball), ]
# Separate the predictors and target (assuming 'target' is the target column)
df_predictors <- df_complete[, -which(names(df_complete) == "TARGET_WINS")]
target <- df_complete$TARGET_WINS # Target variable
# Standardize the predictors (without the target column)
df_standardized <- scale(df_predictors)
# Perform PCA
pca_model <- prcomp(df_standardized, center = TRUE, scale. = TRUE)
# Create a data frame from the principal components
df_pca <- as.data.frame(pca_model$x)
# Ensure that PCA dataframe has the same number of rows as the original data
# Perform PCA
pca_model <- prcomp(df_standardized, center = TRUE, scale. = TRUE)
# Create a data frame from the principal components
df_pca <- as.data.frame(pca_model$x)
# Ensure that PCA dataframe has the same number of rows as the original data
# Add target variable to the PCA dataframe
df_pca$TARGET_WINS <- target
# Fit linear regression model using the first few principal components
# Fit linear regression model using the first few principal components
model1 <- lm(TARGET_WINS ~ PC1 + PC2 + PC3 + PC4+ PC5 + PC6+ PC7 + PC8, data = df_pca)
# View the summary of the model
summary(model1)
##
## Call:
## lm(formula = TARGET_WINS ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 +
## PC7 + PC8, data = df_pca)
##
## Residuals:
## Min 1Q Median 3Q Max
## -40.124 -7.850 0.248 7.811 34.784
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 80.96372 0.28241 286.690 < 2e-16 ***
## PC1 0.16242 0.12452 1.304 0.19229
## PC2 2.67804 0.16209 16.522 < 2e-16 ***
## PC3 0.55381 0.20967 2.641 0.00834 **
## PC4 2.64107 0.21846 12.089 < 2e-16 ***
## PC5 0.05383 0.28799 0.187 0.85175
## PC6 0.91223 0.29534 3.089 0.00205 **
## PC7 -1.02033 0.32214 -3.167 0.00157 **
## PC8 -3.13575 0.40248 -7.791 1.2e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.19 on 1562 degrees of freedom
## Multiple R-squared: 0.2455, Adjusted R-squared: 0.2416
## F-statistic: 63.51 on 8 and 1562 DF, p-value: < 2.2e-16
mse1 <- mean(model1$residuals^2)
r_squared1 <- summary(model1)$r.squared
f_stat1 <- summary(model1)$fstatistic[1]
print(paste("MSE:", mse1, "R-squared:", r_squared1, "F-statistic:", f_stat1))
## [1] "MSE: 124.577029555811 R-squared: 0.245453244052163 F-statistic: 63.5146139366583"
predictions1 <- predict(pca_model, newdata = money_ball_eval)
head(predictions1)
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## [1,] 474.2825 1286.788 540.2961 554.9615 20.4917275 -3.743266 96.377803
## [2,] 374.3241 1367.823 424.1943 565.0550 -0.9815745 0.714383 94.260017
## [3,] 209.5055 1552.395 481.2605 557.5266 25.1898903 -19.439218 32.775770
## [4,] 286.5802 1717.504 689.9862 626.0293 51.1529607 -60.328373 -4.289065
## [5,] -780.1511 2398.486 1199.8731 433.8604 145.7548146 -48.371454 -152.770345
## [6,] -617.0099 2025.148 879.4042 597.4409 79.5795703 -39.887903 -88.836310
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## [1,] -58.24427 -924.6126 -77.72117 233.3323 -1517.418 168.7924 90.73377
## [2,] -86.56731 -786.9401 -68.06990 242.6155 -1443.192 167.8011 86.44689
## [3,] -138.00088 -685.2106 -41.46363 286.6899 -1495.416 173.1672 90.57296
## [4,] -141.38635 -743.6394 -44.18476 330.5530 -1569.971 158.9852 101.25919
## [5,] -593.02639 -842.9900 186.51042 730.1854 -2664.848 -1310.9352 428.52632
## [6,] -433.08484 -638.5348 192.23969 701.9725 -1953.346 -691.9767 275.83371
## PC15 PC16
## [1,] 12.38051 55.30439
## [2,] 16.04831 51.95420
## [3,] 25.28886 54.97951
## [4,] 28.70854 61.17424
## [5,] 755.85512 285.89641
## [6,] 361.24751 154.48454
par(mfrow=c(2,2))
plot(model1)
MODEL 3 Analysis
The residuals (differences between observed and predicted values) range from a minimum of -39.526 to a maximum of 35.287. The distribution of residuals, with a median close to 0 (0.230), suggests that the model has a reasonable balance of over- and under-predictions. The first quartile (1Q) is -7.712, and the third quartile (3Q) is 8.028, indicating that half of the residuals lie between these values, meaning that most predictions deviate from the actual values by around ±8 units.
Overall, the regression model is statistically significant, but it explains only about 23.58% of the variance in the dependent variable, indicating that other variables not included in the model might be influencing the outcome. Among the predictors, PC2, PC4, PC7, and PC8 show strong significant effects, while PC1 and PC5 are not significant contributors to the model. The residuals appear to be moderately dispersed around the predicted values, and the significant predictors provide meaningful insights into the relationships within the data.
MODEL 3 Development
set.seed(123)
# Assuming money_ball_train is your dataset
trainIndex <- createDataPartition(money_ball$TARGET_WINS, p = 0.8,
list = FALSE)
train_data <- money_ball
test_data <- money_ball_eval
control <- trainControl(method = "cv", number = 5) # 5-fold cross-validation
#Train the model (linear regression in this case)
model3 <- train(TARGET_WINS ~ .,
data = train_data,
method = "lm",
trControl = control)
# Summary of the cross-validation results
print(model3)
## Linear Regression
##
## 1571 samples
## 16 predictor
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 1256, 1257, 1257, 1257, 1257
## Resampling results:
##
## RMSE Rsquared MAE
## 10.14872 0.3785483 8.164485
##
## Tuning parameter 'intercept' was held constant at a value of TRUE
print(model3$results)
## intercept RMSE Rsquared MAE RMSESD RsquaredSD MAESD
## 1 TRUE 10.14872 0.3785483 8.164485 0.4272108 0.05484039 0.351182
mse3 <- mean(model3$residuals^2)
r_squared3 <- summary(model3)$r.squared
f_stat3 <- summary(model3)$fstatistic[1]
print(paste("MSE:", mse3, "R-squared:", r_squared3, "F-statistic:", f_stat3))
## [1] "MSE: NaN R-squared: 0.387355522736285 F-statistic: 61.409033365965"
predictions3 <- predict(model3, newdata = money_ball_eval)
head(predictions3)
## 1 2 3 4 5 6
## 60.45857 67.64519 72.18260 82.13139 138.20362 78.96059
This linear regression model performs reasonably well, explaining about 37% of the variance in the target variable and providing a moderate level of accuracy. While the error metrics (RMSE, MAE) indicate that the model is making reasonable predictions, the relatively low R-squared suggests that there is room for improvement, possibly by adding more predictors or using more sophisticated models that can capture non-linear patterns.
# Summary of the cross-validation results
model3$resample
## RMSE Rsquared MAE Resample
## 1 9.805383 0.3973287 7.964758 Fold1
## 2 10.096924 0.4163205 7.976654 Fold2
## 3 9.698260 0.4102635 7.803420 Fold3
## 4 10.728104 0.2826990 8.600262 Fold4
## 5 10.414940 0.3861299 8.477332 Fold5
library(ggplot2)
library(reshape2) # for melt function
# Sample data
data <- data.frame(
RMSE = c(10.211374, 9.748915, 10.693919, 10.120101, 10.037255),
Rsquared = c(0.3975036, 0.4376530, 0.3577975, 0.3476137, 0.3151282),
MAE = c(8.068065, 7.832611, 8.377911, 8.179323, 8.261450),
Resample = c('Fold1', 'Fold2', 'Fold3', 'Fold4', 'Fold5')
)
# Melt data for ggplot
data_melted <- melt(data, id.vars = 'Resample')
# Plot
ggplot(data_melted, aes(x = Resample, y = value, color = variable, group = variable)) +
geom_line() +
geom_point() +
labs(y = "Metric Value", title = "Metrics Across Folds") +
theme_minimal()
# Faceted plot
ggplot(data_melted, aes(x = Resample, y = value)) +
geom_bar(stat = 'identity') +
facet_wrap(~variable, scales = 'free_y') +
labs(y = "Metric Value", title = "Metrics Across Folds") +
theme_minimal()
The model’s performance varies slightly across folds, with Fold2 showing the best RMSE and R-squared values and a relatively low MAE. Fold3 shows the highest RMSE and MAE and the lowest R-squared, indicating it might be the least favorable fold in terms of model performance.The variation in metrics suggests the model’s performance is somewhat consistent but could benefit from further tuning or improvement to ensure better generalization.
Based on the comparison of R-squared, RMSE/MSE, and F-statistic, Model 3 appears to be the best model overall. It has the highest R-squared (0.37), meaning it explains more variance, and its RMSE (10.31) is competitive. While Model 1 has a slightly better MSE and a higher F-statistic, Model 3’s R-squared advantage makes it the better choice for capturing the relationship between variables.
# Log transform on selected variables, excluding TEAM_BATTING_HBP
money_ball_log <- money_ball %>%
mutate(
log_TEAM_BATTING_H = log(TEAM_BATTING_H + 1),
log_TEAM_BATTING_2B = log(TEAM_BATTING_2B + 1),
log_TEAM_BATTING_3B = log(TEAM_BATTING_3B + 1),
log_TEAM_BATTING_HR = log(TEAM_BATTING_HR + 1),
log_TEAM_BATTING_BB = log(TEAM_BATTING_BB + 1),
log_TEAM_PITCHING_H = log(TEAM_PITCHING_H + 1),
log_TEAM_PITCHING_BB = log(TEAM_PITCHING_BB + 1),
log_TEAM_FIELDING_E = log(TEAM_FIELDING_E + 1)
)
# Fit a linear regression model using log-transformed variables
model_log <- lm(TARGET_WINS ~ log_TEAM_BATTING_H + log_TEAM_BATTING_2B +
log_TEAM_BATTING_3B + log_TEAM_BATTING_HR +
log_TEAM_BATTING_BB + TEAM_BATTING_SO +
TEAM_BASERUN_SB + TEAM_BASERUN_CS +
log_TEAM_PITCHING_H + log_TEAM_PITCHING_BB + TEAM_PITCHING_HR +
TEAM_PITCHING_SO + log_TEAM_FIELDING_E + TEAM_FIELDING_DP,
data = money_ball_log)
# Model summary
# summary(model_log)
# Square root transform on selected variables, excluding TEAM_BATTING_HBP
money_ball_sqrt <- money_ball %>%
mutate(
sqrt_TEAM_BATTING_H = sqrt(TEAM_BATTING_H),
sqrt_TEAM_BATTING_2B = sqrt(TEAM_BATTING_2B),
sqrt_TEAM_BATTING_3B = sqrt(TEAM_BATTING_3B),
sqrt_TEAM_BATTING_HR = sqrt(TEAM_BATTING_HR),
sqrt_TEAM_BATTING_BB = sqrt(TEAM_BATTING_BB),
sqrt_TEAM_PITCHING_H = sqrt(TEAM_PITCHING_H),
sqrt_TEAM_PITCHING_BB = sqrt(TEAM_PITCHING_BB),
sqrt_TEAM_FIELDING_E = sqrt(TEAM_FIELDING_E)
)
# Fit a linear regression model using square root-transformed variables
model_sqrt <- lm(TARGET_WINS ~ sqrt_TEAM_BATTING_H + sqrt_TEAM_BATTING_2B +
sqrt_TEAM_BATTING_3B + sqrt_TEAM_BATTING_HR +
sqrt_TEAM_BATTING_BB + TEAM_BATTING_SO +
TEAM_BASERUN_SB + TEAM_BASERUN_CS +
sqrt_TEAM_PITCHING_H + sqrt_TEAM_PITCHING_BB + TEAM_PITCHING_HR +
TEAM_PITCHING_SO + sqrt_TEAM_FIELDING_E + TEAM_FIELDING_DP,
data = money_ball_sqrt)
# Model summary
# summary(model_sqrt)
# Interaction terms between batting and pitching stats
money_ball_interaction_outlier <- money_ball %>%
mutate(
interaction_HR_BB = TEAM_BATTING_HR * TEAM_BATTING_BB,
interaction_Pitching_HR_BB = TEAM_PITCHING_HR * TEAM_PITCHING_BB,
interaction_Fielding_E_DP = TEAM_FIELDING_E * TEAM_FIELDING_DP
)
# Fit a linear regression model with interaction terms
model_interaction_outlier <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
TEAM_BASERUN_SB + TEAM_BASERUN_CS +
TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
TEAM_FIELDING_E + TEAM_FIELDING_DP +
interaction_HR_BB + interaction_Pitching_HR_BB + interaction_Fielding_E_DP,
data = money_ball_interaction_outlier)
# Model summary
# summary(model_interaction_outlier)
# Function to remove outliers based on IQR method
remove_outliers <- function(df, cols) {
for (col in cols) {
Q1 <- quantile(df[[col]], 0.25, na.rm = TRUE)
Q3 <- quantile(df[[col]], 0.75, na.rm = TRUE)
IQR_value <- Q3 - Q1
lower_bound <- Q1 - 1.5 * IQR_value
upper_bound <- Q3 + 1.5 * IQR_value
df <- df[df[[col]] >= lower_bound & df[[col]] <= upper_bound, ]
}
return(df)
}
# Remove outliers from relevant columns, excluding TEAM_BATTING_HBP
money_ball_no_outliers <- remove_outliers(money_ball,
c("TEAM_BATTING_H", "TEAM_BATTING_HR", "TEAM_BATTING_BB",
"TEAM_PITCHING_H", "TEAM_PITCHING_HR", "TEAM_PITCHING_BB",
"TEAM_FIELDING_E"))
# Fit a linear regression model without outliers
model_no_outliers <- lm(TARGET_WINS ~ TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
TEAM_BATTING_HR + TEAM_BATTING_BB + TEAM_BATTING_SO +
TEAM_BASERUN_SB + TEAM_BASERUN_CS +
TEAM_PITCHING_H + TEAM_PITCHING_HR + TEAM_PITCHING_BB + TEAM_PITCHING_SO +
TEAM_FIELDING_E + TEAM_FIELDING_DP,
data = money_ball_no_outliers)
# Model summary
# summary(model_no_outliers)
# Remove outliers based on IQR for key variables
money_ball_clean <- money_ball %>%
filter(
between(TEAM_BATTING_HR, quantile(TEAM_BATTING_HR, 0.25) - 1.5*IQR(TEAM_BATTING_HR), quantile(TEAM_BATTING_HR, 0.75) + 1.5*IQR(TEAM_BATTING_HR)),
between(TEAM_PITCHING_BB, quantile(TEAM_PITCHING_BB, 0.25) - 1.5*IQR(TEAM_PITCHING_BB), quantile(TEAM_PITCHING_BB, 0.75) + 1.5*IQR(TEAM_PITCHING_BB))
)
# Create interaction terms for batting and pitching stats
money_ball_interaction_no_outlier <- money_ball_clean %>%
mutate(
interaction_BATTING = TEAM_BATTING_HR * TEAM_BATTING_BB,
interaction_PITCHING = TEAM_PITCHING_H * TEAM_PITCHING_BB
)
# Fit a linear regression model with interaction terms
model_interaction_no_outlier <- lm(TARGET_WINS ~ interaction_BATTING + interaction_PITCHING +
TEAM_BATTING_H + TEAM_BATTING_2B + TEAM_BATTING_3B +
TEAM_BATTING_SO + TEAM_BASERUN_SB + TEAM_BASERUN_CS +
TEAM_PITCHING_HR + TEAM_PITCHING_SO +
TEAM_FIELDING_DP + TEAM_FIELDING_E,
data = money_ball_interaction_no_outlier)
# Model summary
# summary(model_interaction_no_outlier)
## Model Evaluation Function
evaluate_model <- function(model, data) {
# Predicted values
predictions <- predict(model, data)
# Actual values (target variable)
actuals <- data$TARGET_WINS
# Calculate residuals
residuals <- actuals - predictions
# Calculate metrics
r_squared <- summary(model)$r.squared
adj_r_squared <- summary(model)$adj.r.squared
mse <- mean(residuals^2)
rmse <- sqrt(mse)
f_statistic <- summary(model)$fstatistic[1]
# Print evaluation metrics
cat("Model Evaluation Metrics:\n")
cat("R-squared: ", r_squared, "\n")
cat("Adjusted R-squared: ", adj_r_squared, "\n")
cat("MSE: ", mse, "\n")
cat("RMSE: ", rmse, "\n")
cat("F-statistic: ", f_statistic, "\n")
# Generate diagnostic plots for residual analysis
par(mfrow = c(2, 2))
plot(model)
# Return the evaluation metrics in a list
return(list(
R_squared = r_squared,
Adjusted_R_squared = adj_r_squared,
MSE = mse,
RMSE = rmse,
F_statistic = f_statistic
))
}
evaluation_log <- evaluate_model(model_log, money_ball_log)
## Model Evaluation Metrics:
## R-squared: 0.3728454
## Adjusted R-squared: 0.3672026
## MSE: 103.5444
## RMSE: 10.17567
## F-statistic: 66.07477
evaluation_sqrt <- evaluate_model(model_sqrt, money_ball_sqrt)
## Model Evaluation Metrics:
## R-squared: 0.3819698
## Adjusted R-squared: 0.3764091
## MSE: 102.0379
## RMSE: 10.10138
## F-statistic: 68.69117
evaluation_Interaction_outlier <- evaluate_model(model_interaction_outlier, money_ball_interaction_outlier)
## Model Evaluation Metrics:
## R-squared: 0.3924433
## Adjusted R-squared: 0.3857927
## MSE: 100.3087
## RMSE: 10.01542
## F-statistic: 59.00825
evaluation_no_outliers <- evaluate_model(model_no_outliers, money_ball_no_outliers)
## Model Evaluation Metrics:
## R-squared: 0.3857453
## Adjusted R-squared: 0.3800084
## MSE: 99.49619
## RMSE: 9.974777
## F-statistic: 67.23969
evaluation_interaction_no_outlier <- evaluate_model(model_interaction_no_outlier, money_ball_interaction_no_outlier)
## Model Evaluation Metrics:
## R-squared: 0.3843853
## Adjusted R-squared: 0.3796316
## MSE: 101.6566
## RMSE: 10.08249
## F-statistic: 80.85886
Based on the models we have analyzed, the following variables have coefficients that make sense:
TEAM_BATTING_H (Hits): Positive coefficients across models indicate that more hits lead to more wins, which is expected because hits create scoring opportunities.
TEAM_FIELDING_E (Errors): A negative coefficient for fielding errors is logical since errors typically give opponents extra chances to score, reducing the team’s chances of winning.
TEAM_BASERUN_SB (Stolen Bases): The positive effect on wins is intuitive, as successful base stealing increases scoring potential.
Given the evaluations, we recommend the “Outlier Removal and Interaction Terms” model as the best option. This model has an R-squared of 0.3894 and an Adjusted R-squared of 0.3846, indicating a good level of variance explanation. While its Mean Squared Error (MSE) is 101.5569, and Root Mean Squared Error (RMSE) is 10.07755, the model demonstrates strong predictive capability. Notably, it has the highest F-statistic of 80.30533, suggesting it explains the variance in the outcome variable effectively. Although the “Interaction Terms” model has the highest R-squared (0.3994) and Adjusted R-squared (0.3926), the increased complexity may lead to overfitting and make it less understandable. Therefore, focusing on the “Outlier Removal and Interaction Terms” model effectively balances performance and simplicity, providing more reliable insights for decision-making.
For selecting the best multiple linear regression model, we would focus on a balance between performance and simplicity (parsimony). Let’s analyze the “Outlier Removal and Interaction Terms” model and the “Interaction Terms” model based on each criterion:
Performance vs. Parsimony: The “Outlier Removal and Interaction Terms” model has an R-squared value of 0.3894 and an Adjusted R-squared of 0.3846, indicating it explains a significant portion of the variance. In contrast, the “Interaction Terms” model has a slightly higher R-squared (0.3994) but its increased complexity with multiple predictors and interaction terms can complicate interpretation.
Mean Squared Error (MSE): The “Outlier Removal and Interaction Terms” model has a higher MSE (101.5569) compared to the “Interaction Terms” model (99.9714), but it demonstrates robust predictive capability despite this difference.
F-statistics: The “Outlier Removal and Interaction Terms” model has an F-statistic of 80.30533, which is higher than the “Interaction Terms” model’s F-statistic of 59.05871. This suggests that the “Outlier Removal and Interaction Terms” model explains the variance in the outcome variable more effectively.
In conclusion, while both models have their merits, the “Outlier Removal and Interaction Terms” model effectively balances performance and accessibility, making it the preferred choice for reliable insights.